Speech synthesis system

ABSTRACT

In a system in which a plurality of previously recorded waveforms corresponding to phonetic elements separately picked up from natural voice and having a pitch length, are connected to form any required speech, the degradation in the quality of the synthesized speech due to the discontinuity in the waveform of the synthesized speech is prevented by so controlling the period of reading out each phonetic element as to change the period stepwise at intervals of several phonetic elements (i.e., pitch lengths).

Untied States Patent 1 1 1111 3,892,919 Ichikawa July 1, 1975 [5 1SPEECH SYNTHESIS SYSTEM 3.369.077 2/1968 French 179/1.5 M

4 2 4 [75] Inventor: Akira lchikawa, Kokubunji, Japan L496 H972 Slaw 3OM48 [73] Assignee: Hitachi, Ltd., Japan Primary ExaminerKathleen H.Claffy I Assistant ExaminerE S1 Matt Kemeny [22] Fllcd' 1973 Attorney,Agent, or FirmCraig & Antonelli [21] App]. No: 414,746

[57] ABSTRACT [30] Foreign Application Priority Data In a system inwhich a plurality of previously recorded N v, 13 1972 Japan 47412995waveforms corresponding to phonetic elements separately picked up fromnatural voice and having a pitch 52 0.8. (:1 179/1 SM; 179/1 SM g areconnected to form y required p [51} Int. Cl. Gl0l 1/00 h gr ion in hequality f he ynthesized [58] Field of Search l79/1 SM; speech due to hei c in i y n h veform of the 340/143 152 synthesized speech is preventedby so controlling the period of reading out each phonetic element as to[56] References Cit d change the period stepwise at intervals of severalpho- UNITED netic elements (i pitch lCllglhS).

2,771,509 ll/l956 Dudley l79/l.5 M 3 Claims, 7 Drawing Figures SPEECHSYNTHESIS SYSTEM BACKGROUND OF THE INVENTION 1. Field of the InventionThe present invention relates to a speech synthesis system and moreparticularly to a system in which a sound wave extracted from naturalvoice and having about a pitch length is used as a phonetic segment orspeech segment and in which the phonetic segments previously stored areselectively connected at controlled periods due to control signalscorresponding to a required word or a sentence to be synthesized.

2. Description of the Prior Art In recent years, the information servicesystem which connects data processing devices such as electroniccomputors with communication lines such as telephones, has beendeveloped. In such a system, a remote subscribers question sent througha communication line is received by a central signal processing devicewhich stores large information and the device prepares an answer for thequestion and sends it back to the subscriber, the answer being in theform of sound like human voice.

In this system. the most important is the speech synthesis part whichmakes an answer in the form of voice.

The requirements for the speech synthesis part. however, are as follows:I the synthesized speech must be as near the human voice as possible;the production cost must be low; and the system incorporating the parttherein must permit multiple uses, that is, the part must be able togenerate a plurality of speech at a time.

In a conventional speech synthesis system which is rather satisfactoryfrom the standpoint of the above mentioned requirements. a plurality ofsound waveforms each having a pitch length are previously prepared so asto be used as speech sound waveforms, i.e. speech segments, and thespeech segments are selectively connected due to control signalscorresponding to words or sentences to be synthesized.

This conventional system is rather cheap since any desired speech can besynthesized by connecting speech segments each having a waveform of apitch length so that the number of the stored speech segments isrelatively small. The speech segments can be read out rapidly. that is,the access time is very short, so that the multiple synthesis of speechis possible.

Moreover, the read-out time of a speech segment, that is. the length ofthe waveform of the speech segment can be controlled so that the pitchof the synthesized speech can also be controlled.

Although the conventional system has several merits as mentioned above,it has also been revealed by the inventors experiments that the speechsynthesized by the conventional system suffers from hoarse noises andthat the vocal quality thereof is very poor. The cause of such adrawback is as follows. Namely, in this speech synthesis system,connected speech is formed by connecting the waveforms of speechsegments and there fore a discontinuity, i.e. rapid change in amplitude,is caused in the junction portion between any two adjacent waveforms ofspeech segments and such discontinuities appear every pitch period(equal to the fundamental period of speech and having an audible rangeof frequencies) to generate hoarse noises in synthesized speech.

SUMMARY OF THE INVENTION One object of the present invention is toimprove the quality of the synthesized speech produced by a speechsynthesis system in which a plurality of speech sound waveforms, eachhaving a pitch length. to be used as speech segments are recorded andthese speech segments are selectively connected to form synthesizedspeech.

Another object of the present invention is to provide a speech synthesissystem in which a plurality of speech sound waveforms, each having apitch length. to be used as speech segments are recorded and thesespeech segments are selectively connected to form synthesized speech,and in which the pitch control of speech sounds is simplified so thatthe system can be economically fabricated Without deterioration in thevocal quality of the synthesized speech.

According to the present invention. which has been made to attain theabove objects, in a speech synthesizing system in which speech segments,each having a pitch length, are selectively connected to synthesizedesired speech, the time of reading out each speech segment, that is,the wavelength of each speech segment of synthesized speech is stepwisechanged at intervals of several speech segments. Namely, the waveformsof speech segments read out are changed at intervals of one fifth of asyllable to a full syllable. Therefore. the system according to thepresent invention can produce synthesized speech softer to ear than thatproduced by a conventional speech synthesis system in which the lengthof the waveform of every speech segment is controlled individually.

Other objects, features and advantages of the present invention will bemade apparent when one reads the following part of the specificationwith the aid of the attached drawings.

BRIEF DESCRIPTION OF THE DRAWING FIG. I is an oscillographicrepresentation of a monosyllable speech sound waveform.

FIG. 2 shows the modes of variations in the pitch frequency ofmonosyllable speech sounds in various pronounciations.

FIG. 3 shows the variations in the pitch frequency of one word.

FIGS. 4A and 4B show waveforms illustrating the discontinuitiesresulting from the connection of separate speech segments.

FIG. 5 shows the variation in pitch frequency of the synthesized speechformed by the speech synthesizing system according to the presentinvention.

FIG. 6 is a block diagram of a speech synthesis system embodying thepresent invention.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT In FIG. I, the waveformof a monosyllable speech sound is shown in a rectangular coordinatesystem in which the abscissa represents the time base and the ordinatcgives the amplitude of waveform. As seen from FIG. I, the waveform ofthe monosyllable speech sound consists of an irregular portion C likethat of a consonant and a periodical portion V like that ofa vowel.Especially. every syllable of the Japanese speech is composed of asingle consonant followed by a single vowel or of a single vowel. Andabout one hundred different syllables can make up all the speech soundscovering the entire vocabulary of the Japanese language. Of the portionsof the waveform shown in FIG. I, the more important is the periodicalportion v which occupies most part of the monosyllable speech soundwaveform and forms the factors of the pitch, intonation and tone(indicating the kind of syllable) of the speech sound.

Namely. the pitch or intonation of the speech sound depends mainly onthe repetition periods T T T i.e. the pitch period, while the tone isdetermined by the frequency characteristic of the periodical portion V.The pitch period is usually l() to milliseconds.

FIG. 2 shows the variation in the pitch frequency (de fined as thereciprocal of the pitch period) with time of the monosyllable speechsound shown in FIG, I. In FIG. 2, the abscissa and the ordinaterespectively represent the time base and the pitch frequency, When amonosyllable speech sound is individually pronounced, it has acharacteristic curve I convex up as shown in FIG. 2. However, when thesame speech sound is pronounced in a word or sentence, it may assumecharac teristic curves 2, 3 and 4 corresponding respectively to level,rising and falling intonation, depending upon the position it assumes inthe word or sentence or upon the kind of word or sentence.

Accordingly, in case where the convected speech sounds corresponding toa desired word or sentence are formed by connecting together theprerecorded speech segments, Le. speech sound waveforms obtained bydividing the waveform of the monosyllable speech sound as shown in FIG.1, pronounced in a manner corresponding to the curve I in FIG. 2, intounits, each having a pitch length, the discontinuities are formed in thejunction points between the unit waveforms, ie. speech segmentwaveforms, the discontinuities being the por tions where the amplitudesof the waveform rapidly change,

Such discontinuities will be described in further de tail. FIG. 3 showsthe variation in pitch frequency with time of the speech soundcorresponding to a word, in which the abscissa and the ordinaterespectively represent the time base and the pitch frequency. In FIG. 3,curve 5 indicates the mode of the variation in pitch frequency ofnatural speech sound corresponding to a word to be synthesized, whilecurve 6 shows the mode of the variation in the pitch frequency of themonosyl lable speech sound corresponding to the curve I in FIG. 2. Theabscissa is divided into pronounciation intervals r r t of themonosyllable sounds, Accordingly, in order that the speech sound havinga pitch frequency characteristic corresponding to the curve 5 may becomposed of speech segments obtained from the natural voice having apitch frequency characteris tic corresponding to the curve 6, the lengthof the waveform ofeach speech segment, i.e pitch period, has to becontrolled. Therefore, if the waveforms of the speech segments havingpitch periods T T T as in FIG. I are connected and synthesized intoconnected speech having pitch periods longer or shorter than thoseperiods T T T;, and T then the disconti nuities 7 are formed in thejunction portion of the respective speech segment waveforms as shown inFIGS. 4A and 48. FIG. 4A corresponds to the case where the synthesizedspeech has a pitch frequency higher than that of the original naturalvoice from which the speech segment waveforms are obtained and has apitch period shorter than that of the natural voice. FIG. 48. on theother hand, corresponds to the case where the synthesized spcech has apitch frequency lower than that of the original natural voice and apitch period longer than that of the original natural voice. Thedicontinuities 7 thus resulted deteriorate the vocal quality of thesynthesized speech and also generate hoarse noises.

In order to eliminate the influence of the discontinuities. a specialtreatment of waveforms must be introduced. According to the presentinvention, the degradation of the vocal quality due to thediscontinuities can be prevented since the way of the pitch control inthe speech control system is improved, and moreover a system can berealized in which the pitch control is further simplified by making thebest use of the merits of the speech synthesizing system in which speechsegments are connected to form synthesized speech.

Namely, as shown in FIG. 5, the pitch frequency or the pitch period ofthe synthesized speech is changed stepwise at intervals of a quarter ofa syllable to a full syllable, It is empirically verified that thesynthesized speech having a pitch frequency characteristic corresponding to a staircurve 8 indicated by dotted line FIG. 5, has a vocalquality superior to that having a pitch fre quency characteristicindicated by a solid curve 5 in FIG. 5. In this case, it is needless toperform the pitch control for every speech segment and since the pitchperiods of the successive speech segments are all the same, the pitchcontrol system of the speech synthesis system is simplified.

In the following, the present invention will be described by way of apreferred embodiment.

FIG. 6 is a block diagram of a concrete structure of a speech synthesissystem embodying the present invention.

First, a speech segment memory 32 is described for convenience sake. Inthe memory 32, the speech sound waveforms of all the syllables necessaryfor the speech synthesis are stored in a high speed memory device suchas a core memory. Each syllable in the mem ory consists oftime-sequentially arranged speech segments constituting a waveform asshown in FIG. I and the waveform of each speech segment has an addressallotted to indicate its location in the memory, In a monosyllable,serial numbers are allotted to the addresses of the speech segmentwaveforms arranged in time-sequence. Therefore, the first address isused as a syllable address to represent the syllable.

Each speech segment waveform is obtained by sampling the speech soundwaveform shown in FIG. 1 at 8KI-I2 and each of the sampled signal iscoded into an 8-bit signal. The period at which one speech segment, i.e.wave portion within T T T or T in FIG. 1, is recorded is 10 to 20 msec,Namely, the period is set equal to the maximum one of the pitch periodsof speech sounds to be synthesized.

A series of code signals, each representing one syllable, to constitutespeech to be synthesized are received at a terminal 9 and fed through aninput-output control circuit 10 to a data processing circuit 11. Forexample, code signals corresponding to the syllables YO, KO, HA and MAconstituting the name of a famous port city of Japan, are applied to thecircuit II. The device to generate such code signals is not within thescope of this invention and not shown in the figure, but the device isequivalent to the conventional automatic response system, being designedto form data for answers to preset questions and to connect the codesignals according to the arrangement of words corresponding to thoseanswers.

The data processing circuit ll interprets the code signals according tothe predetermined program and generates signals instructing andcontrolling the operations of the respective parts of the speechsynthesizing apparatus described later.

The operation of the circuit 11 will be described in further detail.Judging from the series of code signals, the circuit 11 generates speechsegment information, pitch information and syllable time informationaccording to a reference table.

The speech segment information is, for example, the address of the firstspeech segment of a syllable stored in the speech segment memory 32described above; the pitch information is the information indicated bydotted curve 8 in FIG. 5, that is, the number indicating how manysamples, counted from the first one, of the speech segments stored inthe memory 32 is to be read out; and the syllable time information isthe time information representing r, to I in FIG. 5, that is, the numberof samples to be read out within the time of one syllable.

The data processing circuit to perform such processing as describedabove may be designed especially for the present invention but a generalpurpose computer can be used as such a circuit so that the detailsthereof is omitted.

The three kinds of information are respectively stored astime-sequential signals in a syllable address buffer memory 14, a pitchtime buffer memory l5 and a syllable time buffer memory 16 ofa speechsynthesizing apparatus 13. The speech synthesizing apparatus 13 consistsof a part to select speech segments necessary to synthesize connectedspeech according to the speech segment information, a part to determinethe pitch periods of the speech segments according to the pitchinformation and a part to determine the times allotted to syllablesaccording to the syllable time infor mation.

Next, the operations of the respective components of the speechsynthesizing apparatus 13 will be described.

The address data of the syllable address memory 14 are transferred oneby one to a segment address memory 17, in response to an external signaland simultaneously the data in the syllable address memory 14 is shiftedforward to cause the address of the next syllable to come to the headposition. Namely, the memory 14 and the memory 17 may be considered toform a shift register. Also, the combination of the pitch time buffermemory 15 and a pitch time memory or of the syllable time buffer memoryl6 and a syllable time memory may be also considered to form a shiftregister.

With the circuit arrangement as described above, the address signal ofthe first speech segment of a syllable stored in the segment memory 17is applied to a read out circuit 29 so that a series of sampled valuesconstituting the segment are sequentially read out in synchronism withclock pulses from a clock signal generator 20. The number of the readoutsamples is detected by counting the clock pulses by a pitch counter 22.When the content of the pitch counter 22 coincides with the pitch timedata set in the pitch memory. a coincidence circuit 25 detects theinstant of coincidence to deliver a coincidence pulse. The coincidencepulse serves not only to reset the pitch counter 22 but also to shift asegment address counter 21 step by stepv The output of the shiftedsegment address counter 21 is applied to the segment address memory 17to read out the next speech segment from the speech segment memory 32,in the same manner as described above. Thereafter. the same operation ofreading out the sampled values is repeated on. The coincidence pulsealso resets the counter 23 at the same time.

On the other hand, the time counter 23 also counts the clock pulses, andwhen the content of the time counter 23 coincides with the syllable timedata (that is, the number of sampling points occurring within a timeduring which the pitch frequency in one syllable remain the same, asdescribed above) set in the syllable memory 19, a coincidence circuit 26detects the instant of coincidence to deliver a coincidence pulse at theinstant.

The coincidence pulse serves not only to transfer or shift the foremostpitch time data of the pitch time buffer memory 15 to the pitch timememory 18, but also to shift a syllable counter 24 step by step. Whenthe content of the syllable counter 24 coincides with the step numberrecorded in a step number memory 23, a coincidence circuit 27 detectsthe instant of coincidence to deliver a coincidence pulse. Thecoincidence pulse resets the counter 24 and is also applied to thesyllable address buffer memory 14 and the syllable time buffer memory 16so that the control information for the syllable to be next synthesized,i.e. segment address and time data for the syllable, is transferredrespectively to the memory 17 and the memory 19. The step number storedin the step number memory refers to the number of steps occurring withina time of one syllable when the pitch frequency is changed stepwise asshown in FIG. 5. In case of FIG. 5, the number of steps is three. As hasbeen revealed from the experiments by the inventors, it is where thenumber of steps is three that the deterioration of the vocal quality ofthe synthesized speech due to the waveform discontinuities is reduced tothe minimum. However, the number of steps need not be limitednecessarily to 3 but may be 4 to 0, that is, the pitch frequency of thesynthesized speech sounds may be varied at intervals of a quarter ofasyllable or a full syllable.

The output signal obtained from the read out circuit 29 as a result ofthe operations as described above is equivalent to a signal obtained bysubjecting the signal waveform shown in FIG. 4A or 43 to pulse codemodulation since the speech synthesizing circuit 13 consists of digitalcircuits. The signal is then converted to an analog signal through andigital-to-analog converter 30 and the analog signal is finallyconverted to a speech sound signal or audible voice through anelectroacoustic transducer 31. In this case, the digital-toanalogconverter 30 and the electro acoustic trans ducer 31 are connected bysuch a transmission line as a telephone which electrically connects aremote subscriber with the central information service system.

The speech synthesis system shown in FIG. 6 has been described asapplied to the case where the speech sounds only for one channel aresynthesized. It is, however, a matter of course that since the wholesystem is composed of digital signal treating circuits and the speechsegments are stored in such a memory as a core memory capable of highspeed access then the system can be easily constructed in a multichannelarrangement as known in the field of the art.

Namely, such an arrangement for multichannel purpose can be realized ifthe input-output control circuit 10, the data processing circuit 1] andthe speech segment memory 32 are used commonly and ifthe number of thespeech synthesizing apparatuses 13 is increased according to the numberof channels required.

Moreover, the speech segments stored in the speech segment memory may beobtained by directly extracting the components from the natural humanvoice or by artificially treating the waveforms of the human speechsounds I claim:

1. A speech synthesis system comprising:

a speech segment memory which stores a plurality of speech segments;

a synthesizing apparatus, coupled to said speech segment memory,including first means for selecting desired speech segments from saidspeech segment memory,

second means, coupled to said first means, for controlling the pitchperiod of each of said desired speech segments, so as to change thepitch frequency of the synthesized speech step-wise, and

third means, coupled to said first means and said second means, forconnecting the desired pitch controlled speech segments together.

2. A speech synthesis system according to claim I, wherein said secondmeans includes means for adjust ing the intervals of the stepwise changebetween a quar ter of a syllable and a full syllable.

3. A speech synthesis system comprising: a, a data processing circuit toconvert code signals representative of the syllables of words to besynthesized into control signals for speech synthesis; b, a speechsegment memory to store speech segments each having a waveform of abouta pitch length; c. a speech synthesizing apparatus, coupled to said dataprocessing circuit and to said speech segment memory, and includingfirst means for selecting desired speech segments in said speech segmentmemory second means, coupled to said first means. for controlling theread time of each of the selected speech segments, so as to change thepitch period of the synthesized speech stepwise at intervals of aquarter of a syllable to a full syllable and,

third means, coupled to said first means and to said second means, forsynthesizing speech sound waveform signals by connecting the selectedand pitch controlled speech segment signals together, in response to thecontrol signals from the data processing circuit; and

d. an electro-acoustic converting device, coupled to said speech segmentmemory and said speech synthesizing apparatus, for converting the speechsound waveform signals into corresponding speech sounds,

1. A speech synthesis system comprising: a speech segment memory whichstores a plurality of speech segments; a synthesizing apparatus, coupledto said speech segment memory, including first means for selectingdesired speech segments from said speech segment memory, second means,coupled to said first means, for controlling the pitch period of each ofsaid desired speech segments, so as to change the pitch frequency of thesynthesized speech stepwise, and third means, coupled to said firstmeans and said second means, for connecting the desired pitch controlledspeech segments together.
 2. A speech synthesis system according toclaim 1, wherein said second means includes means for adjusting theintervals of the stepwise change between a quarter of a syllable and afull syllable.
 3. A speech synthesis system comprising: a. a dataprocessing circuit to convert code signals representative of thesyllables of words to be synthesized into control signals for speechsynthesis; b. a speech segment memory to store speech segments eachhaving a waveform of about a pitch length; c. a speech synthesizingapparatus, coupled to said data processing circuit and to said speechsegment memory, and including first means for selecting desired speechsegments in said speech segment memory second means, coupled to saidfirst means, for controlling the read time of each of the selectedspeech segments, so as to change the pitch period of the synthesizedspeech stepwise at intervals of a quarter of a syllable to a fullsyllable and, third means, coupled to said first means and to saidsecond means, for synthesizing speech sound waveform signals byconnecting the selected and pitch controlled speech segment signalstogether, in response to the control signals from the data processingcircuit; and d. an electro-acoustic converting device, coupled to saidspeech segment memory and said speech synthesizing apparatus, forconverting the speech sound waveform signals into corresponding speechsounds.